1 Instructor names and contact information
- Benjamin P. Berman (Benjamin.Berman at cshs.org)
- Tiago C. Silva (tiagochst at gmail.com)
2 Workshop description
2.1 Aims
This workshop will:
- Introduce you to the TCGA (The Cancer Genome Atlas) data available at the NCI’s Genomic Data Commons (GDC).
- Demonstrate how to access and import TCGA data using the R/Bioconductor package TCGAbiolinks (Colaprico et al. 2015)
- Provide instruction to visualize copy number alteration and mutation data using the R/Bioconductor package maftools
2.2 Useful links
- the NCI’s Genomic Data Commons (GDC): https://gdc.cancer.gov/.
- TCGAbiolinks package: http://bioconductor.org/packages/TCGAbiolinks/
- maftools package: http://bioconductor.org/packages/maftools/
2.3 Reference articles
- TCGAbiolinks:
- Colaprico, Antonio, et al. “TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data.” Nucleic acids research 44.8 (2015): e71-e71. Silva, Tiago C., et al. “TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages.” F1000Research 5 (2016). (https://f1000research.com/articles/5-1542/v2)
- Mounir, Mohamed, et al. “New functionalities in the TCGAbiolinks package for the study and integration of cancer data from GDC and GTEx.” PLoS computational biology 15.3 (2019): e1006701. (https://doi.org/10.1371/journal.pcbi.1006701)
- Silva TC, Colaprico A, Olsen C et al.TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages [version 2; peer review: 1 approved, 2 approved with reservations]. F1000Research 2016, 5:1542 (https://doi.org/10.12688/f1000research.8923.2).
- The NCI’s Genomic Data Commons (GDC):
- Gao, Galen F., et al. “Before and After: Comparison of Legacy and Harmonized TCGA Genomic Data Commons’ Data.” Cell systems 9.1 (2019): 24-34. (https://doi.org/10.1016/j.cels.2019.06.006)
- Maftools:
- Mayakonda A, Lin D, Assenov Y, Plass C, Koeffler PH (2018). “Maftools: efficient and comprehensive analysis of somatic variants in cancer.” Genome Research. doi: 10.1101/gr.239244.118.
- Complexheatmap:
- Gu Z, Eils R, Schlesner M (2016). “Complex heatmaps reveal patterns and correlations in multidimensional genomic data.” Bioinformatics.
2.4 R/Bioconductor packages used
2.5 Pre-requisites
- Basic knowledge of R syntax
- Understand the pipe operator (“%>%”) (help material https://r4ds.had.co.nz/pipes.html)
- Understand the SummarizedExperiment data structure (help material http://bioconductor.org/packages/SummarizedExperiment/)
3 Introduction
3.1 Understanding GDC (genomics data common portal)
In this workshop we will access (The Cancer Genome Atlas) data available at the NCI’s Genomic Data Commons (GDC). Before we start, it is important to know that the data is deposit in two different databases: - The legacy archive (https://portal.gdc.cancer.gov/legacy-archive/search/f) which contains unharmonized legacy data from repositories that predate the GDC (e.g. CGHub). Also, legacy data is not actively maintained, processed, or harmonized by the GDC (Source: https://docs.gdc.cancer.gov/Data_Portal/Users_Guide/Legacy_Archive/)
- Harmonized database (https://portal.gdc.cancer.gov/) which contains data from many cancer projects processed using standardized pipelines and the reference genome GRCh38 (hg38). This gives the advantage of analyzing multiple cancer types or the same cancer type across multiple projects. You can find more information about the pipelines supporting data harmonization at https://gdc.cancer.gov/about-data/gdc-data-harmonization and at https://docs.gdc.cancer.gov/Encyclopedia/pages/Harmonized_Data/.
An overview of the web portal is available at https://docs.gdc.cancer.gov/Data_Portal/Users_Guide/Getting_Started/.
Figure: TCGA databases (blue: legacy archive. red: harmonized database) Source: https://portal.gdc.cancer.gov/
3.2 Understanding TCGA data
3.2.1 Data access
GDC provides the data with two access levels:
- Open: includes high level genomic data that is not individually identifiable, as well as most clinical and all biospecimen data elements.
- Controlled: includes individually identifiable data such as low-level genomic sequencing data, germline variants, SNP6 genotype data, and certain clinical data elements
You can find more information about those two levels and how to get access to controlled data at: https://gdc.cancer.gov/access-data/data-access-processes-and-tools.
3.2.2 TCGA barcode description
Each TCGA sample has a unique identifier called TCGA barcode, which contains important information about each sample. A description of the barcode is shown below (Source: https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/).
Figure: TCGA barcode Source: https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/
You can find a table with all the code and meanings at https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables. The Sample Type Codes is shown below:
Figure: TCGA samples type. Source: https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes
3.2.3 Data structure
In order to filter the data available in GDC some fields are available such as project (TCGA, TARGET, etc.), data category (Transcriptome Profiling, DNA methylation, Clinical, etc.), data type (Gene Expression Quantification, Isoform Expression Quantification, Methylation Beta Value, etc.), experimental strategy (miRNA-Seq, RNA-Seq, etc.), Workflow Type, platform, access type and others.
Figure: Data category filter
Figure: Data type filter
In terms of data granularity, a project has data on several categories, each category contains several data types that might have been produced with different workflows, experimental strategy and platforms. In that way, if you select data type “Gene Expression Quantification” the data category will be Transcriptome Profiling.
Figure: Filtering by Gene Expression Quantification
You can find the entry possibilities for each filter at the repository page of the database at https://portal.gdc.cancer.gov/repository.
3.3 The SummarizedExperiment data structure
Before we start, it is important to know that the R/Bioconductor environment provides a data structure called SummarizedExperiment, which was created to handle both samples metadata (age, gender, etc), genomics data (i.e. DNA methylation beta value) and genomics metadata information (chr, start, end, gene symbol) in the same object. You can access samples metadata with colData function, genomics data with assays and genomics metadata with rowRanges.
Figure: Summarized Experiment data structure. Source: http://bioconductor.org/packages/SummarizedExperiment/
3.4 Loading required packages
library(TCGAbiolinks)
library(MultiAssayExperiment)
library(maftools)
library(dplyr)
library(ComplexHeatmap)In case those packages are not available, you can install those packages using BiocManager.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
BiocManager::install("ComplexHeatmap") # from bioconductor
if (!requireNamespace("TCGAbiolinks", quietly = TRUE))
BiocManager::install("BioinformaticsFMRP/TCGAbiolinks") # from github
4 Working with TCGA data
During this workshop we will be using open access data from the harmonized GDC database (hg38 version), which can be accessed at https://portal.gdc.cancer.gov/.
Through the next sections we will:
- Access TCGA clinical data
- Search, download and make an R object from TCGA data
- RNA-seq data
- DNA methylation
- Download samples metadata
- Download and visualize mutations
- Download and visualize copy number alterations
4.1 Clinical data
In GDC database the clinical data can be retrieved from different sources:
- indexed clinical: a refined clinical data that is created using the XML files.
- XML files: original source of the data
- BCR Biotab: tsv files parsed from XML files
There are two main differences between the indexed clinical and XML files:
- XML has more information: radiation, drugs information, follow-ups, biospecimen, etc. So the indexed one is only a subset of the XML files
- The indexed data contains the updated data with the follow up information. For example: if the patient is alive in the first time clinical data was collect and the in the next follow-up he is dead, the indexed data will show dead. The XML will have two fields, one for the first time saying he is alive (in the clinical part) and the follow-up saying he is dead.
Other useful clinical information available are:
- Tissue slide image
- Pathology report - Slide image
You can check examples in TCGAbiolinks vignette at https://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/clinical.html.
Figure: Clinical data patient: TCGA-AA-3562. Souce:https://portal.gdc.cancer.gov/cases/127f738d-afa1-42f6-a9a0-f33806667b84
The clinical indexed data is the same you can see in the GDC data portal. Obs: the GDC API probides age_at_diagnosis in days.
# Same case as figure above
clinical %>%
dplyr::filter(submitter_id == "TCGA-AA-3562") %>%
t %>%
as.data.frameYou can also acess Biotab files which are TSV files created from the XML files (only works for TCGAbiolinks verion higher than 2.13.5).
query <- GDCquery(project = "TCGA-ACC",
data.category = "Clinical",
data.type = "Clinical Supplement",
data.format = "BCR Biotab")
GDCdownload(query)
clinical.BCRtab.all <- GDCprepare(query)## [1] "clinical_omf_v4.0_acc" "clinical_follow_up_v4.0_nte_acc"
## [3] "clinical_radiation_acc" "clinical_patient_acc"
## [5] "clinical_nte_acc" "clinical_follow_up_v4.0_acc"
## [7] "clinical_drug_acc"
4.2 RNA-Seq data
The RNA-Seq pipeline produces raw counts, FPKM and FPKM-UQ quantifications and is described at https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/.
Figure: gene expression pipeline. Souce: https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/
Figure: mRNA files. Souce: https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/
The following options are used to search mRNA results using TCGAbiolinks:
- data.category: “Transcriptome Profiling”
- data.type: “Gene Expression Quantification”
- workflow.type: “HTSeq - Counts”, “HTSeq - FPKM”, “HTSeq - FPKM-UQ”
Here is the example to download the raw counts, which can be used with (http://bioconductor.org/packages/DESeq2/) for differential expression analysis.
query.exp.hg38 <- GDCquery(project = "TCGA-GBM",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"))
GDCdownload(query.exp.hg38)
raw.counts <- GDCprepare(query = query.exp.hg38, summarizedExperiment = FALSE)Here is the example to download the FPKM-UQ counts.
query.exp.hg38 <- GDCquery(project = "TCGA-GBM",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - FPKM-UQ",
barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"))
GDCdownload(query.exp.hg38)
fpkm.uq.counts <- GDCprepare(query = query.exp.hg38, summarizedExperiment = FALSE)4.3 Mutation
TCGAbiolinks has provided a few functions to download mutation data from GDC. There are two options to download the data:
- Use
GDCquery_Mafwhich will download MAF aligned against hg38.
This example will download MAF (mutation annotation files) for variant calling pipeline muse. Pipelines options are: muse, varscan2, somaticsniper, mutect. For more information please access GDC docs.
You can download the data using TCGAbiolinks GDCquery_Maf function.
Then visualize the results using the maftools package.
# using maftools for data summary
maftools.input <- maf %>% read.maf
# Check summary
plotmafSummary(maf = maftools.input,
rmOutlier = TRUE,
addStat = 'median',
dashboard = TRUE)# classifies Single Nucleotide Variants into Transitions and Transversions
titv = titv(maf = maftools.input,
plot = FALSE,
useSyn = TRUE)
plotTiTv(res = titv)You can extract sample summary from MAF object.
4.4 Copy number alteration data
The Copy Number Variation Analysis Pipeline is described at https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/
Numeric focal-level Copy Number Variation (CNV) values were generated with “Masked Copy Number Segment” files from tumor aliquots using GISTIC2 on a project level. Only protein-coding genes were kept, and their numeric CNV values were further thresholded by a noise cutoff of 0.3:
- Genes with focal CNV values smaller than -0.3 are categorized as a “loss” (-1)
- Genes with focal CNV values larger than 0.3 are categorized as a “gain” (+1)
- Genes with focal CNV values between and including -0.3 and 0.3 are categorized as “neutral” (0).
You can access “Gene Level Copy Number Scores” from GISTIC with the code below:
query <- GDCquery(project = "TCGA-GBM",
data.category = "Copy Number Variation",
data.type = "Gene Level Copy Number Scores",
access = "open")
GDCdownload(query)
scores <- GDCprepare(query)You can visualize the data using the R/Bioconductor package complexHeatmap.
scores.matrix <- scores %>%
dplyr::select(-c(1:3)) %>% # Removes metadata from the first 3 columns
as.matrix
rownames(scores.matrix) <- paste0(scores$`Gene Symbol`,"_",scores$Cytoband)
# gain in more than 100 samples
gain.more.than.hundred.samples <- which(rowSums(scores.matrix == 1) > 100)
# loss in more than 100 samples
loss.more.than.hundred.samples <- which(rowSums(scores.matrix == -1) > 100)
lines.selected <- c(gain.more.than.hundred.samples,loss.more.than.hundred.samples)
Heatmap(scores.matrix[lines.selected,],
show_column_names = FALSE,
show_row_names = TRUE,
row_names_gp = gpar(fontsize = 8),
col = circlize::colorRamp2(c(-1,0,1), colors = c("red","white","blue")))4.5 DNA methylation data
The processed DNA methylation data measure the level of methylation at known CpG sites as beta values, calculated from array intensities (Level 2 data) as Beta = \(M/(M+U)\) (Zhou, Laird, and Shen 2017) which ranges from 0 being unmethylated and 1 fully methylated.
More information about the DNA methylation pipeline is available at https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Methylation_LO_Pipeline/.
We will download two Glioblastoma (GBM) as a summarizedExperiment object.
query_met.hg38 <- GDCquery(project = "TCGA-GBM",
data.category = "DNA Methylation",
platform = "Illumina Human Methylation 27",
barcode = c("TCGA-02-0116-01A","TCGA-14-3477-01A-01D"))
GDCdownload(query_met.hg38)
data.hg38 <- GDCprepare(query_met.hg38,summarizedExperiment = TRUE)## class: RangedSummarizedExperiment
## dim: 27578 2
## metadata(1): data_release
## assays(1): ''
## rownames(27578): cg00000292 cg00002426 ... cg27662877 cg27665659
## rowData names(7): Composite.Element.REF Gene_Symbol ...
## CGI_Coordinate Feature_Type
## colnames(2): TCGA-02-0116-01A-01D-0199-05
## TCGA-14-3477-01A-01D-0915-05
## colData names(113): sample patient ...
## subtype_Telomere.length.estimate.in.blood.normal..Kb.
## subtype_Telomere.length.estimate.in.tumor..Kb.
You can access the probes information with rowRanges.
You can access the samples metadata with colData.
You can access the DNA methylation levels with assay.
# plot 10 most variable probes
data.hg38 %>%
assay %>%
rowVars %>%
order(decreasing = TRUE) %>%
head(10) -> idx
pal_methylation <- colorRampPalette(c("#000436",
"#021EA9",
"#1632FB",
"#6E34FC",
"#C732D5",
"#FD619D",
"#FF9965",
"#FFD32B",
"#FFFC5A"))(100)
Heatmap(assay(data.hg38)[idx,],
show_column_names = TRUE,
show_row_names = TRUE,
name = "Methylation Beta-value",
row_names_gp = gpar(fontsize = 8),
column_names_gp = gpar(fontsize = 8),
col = circlize::colorRamp2(seq(0, 1, by = 1/99), pal_methylation))4.6 ATAC-Seq data
Please, check our ATAC-seq Workshop: http://rpubs.com/tiagochst/atac_seq_workshop
5 Session information
## R version 3.5.3 Patched (2019-03-11 r76979)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid parallel stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] ComplexHeatmap_2.1.0 dplyr_0.8.3
## [3] maftools_2.0.16 MultiAssayExperiment_1.8.3
## [5] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
## [7] BiocParallel_1.16.6 matrixStats_0.54.0
## [9] Biobase_2.42.0 GenomicRanges_1.34.0
## [11] GenomeInfoDb_1.18.2 IRanges_2.16.0
## [13] S4Vectors_0.20.1 BiocGenerics_0.28.0
## [15] TCGAbiolinks_2.13.5
##
## loaded via a namespace (and not attached):
## [1] backports_1.1.4 circlize_0.4.6
## [3] aroma.light_3.12.0 NMF_0.21.0
## [5] plyr_1.8.4 selectr_0.4-1
## [7] ConsensusClusterPlus_1.46.0 lazyeval_0.2.2
## [9] splines_3.5.3 ggplot2_3.2.1
## [11] gridBase_0.4-7 sva_3.30.1
## [13] digest_0.6.20 foreach_1.4.7
## [15] htmltools_0.3.6 magrittr_1.5
## [17] memoise_1.1.0 cluster_2.0.7-1
## [19] doParallel_1.0.15 limma_3.38.3
## [21] Biostrings_2.50.2 readr_1.3.1
## [23] annotate_1.60.1 wordcloud_2.6
## [25] R.utils_2.9.0 prettyunits_1.0.2
## [27] colorspace_1.4-1 blob_1.2.0
## [29] rvest_0.3.4 ggrepel_0.8.1
## [31] xfun_0.8 crayon_1.3.4
## [33] RCurl_1.95-4.12 jsonlite_1.6
## [35] genefilter_1.64.0 zeallot_0.1.0
## [37] survival_2.43-3 zoo_1.8-6
## [39] iterators_1.0.12 glue_1.3.1
## [41] survminer_0.4.5 registry_0.5-1
## [43] gtable_0.3.0 zlibbioc_1.28.0
## [45] XVector_0.22.0 GetoptLong_0.1.7
## [47] shape_1.4.4 scales_1.0.0
## [49] DESeq_1.34.1 rngtools_1.3.1
## [51] DBI_1.0.0 edgeR_3.24.3
## [53] bibtex_0.4.2 ggthemes_4.2.0
## [55] Rcpp_1.0.2 xtable_1.8-4
## [57] progress_1.2.2 clue_0.3-57
## [59] bit_1.1-14 matlab_1.0.2
## [61] km.ci_0.5-2 httr_1.4.1
## [63] RColorBrewer_1.1-2 pkgconfig_2.0.2
## [65] XML_3.98-1.20 R.methodsS3_1.7.1
## [67] locfit_1.5-9.1 reshape2_1.4.3
## [69] tidyselect_0.2.5 rlang_0.4.0
## [71] AnnotationDbi_1.44.0 munsell_0.5.0
## [73] tools_3.5.3 downloader_0.4
## [75] generics_0.0.2 RSQLite_2.1.2
## [77] broom_0.5.2 evaluate_0.14
## [79] stringr_1.4.0 yaml_2.2.0
## [81] knitr_1.24 bit64_0.9-7
## [83] survMisc_0.5.5 purrr_0.3.2
## [85] EDASeq_2.16.3 nlme_3.1-137
## [87] R.oo_1.22.0 xml2_1.2.2
## [89] biomaRt_2.38.0 compiler_3.5.3
## [91] curl_4.0 png_0.1-7
## [93] ggsignif_0.6.0 tibble_2.1.3
## [95] geneplotter_1.60.0 stringi_1.4.3
## [97] GenomicFeatures_1.34.6 lattice_0.20-38
## [99] Matrix_1.2-15 KMsurv_0.1-5
## [101] vctrs_0.2.0 pillar_1.4.2
## [103] GlobalOptions_0.1.0 data.table_1.12.2
## [105] bitops_1.0-6 rtracklayer_1.42.2
## [107] R6_2.4.0 latticeExtra_0.6-28
## [109] hwriter_1.3.2 ShortRead_1.40.0
## [111] gridExtra_2.3 codetools_0.2-16
## [113] assertthat_0.2.1 pkgmaker_0.27
## [115] rjson_0.2.20 withr_2.1.2
## [117] GenomicAlignments_1.18.1 Rsamtools_1.34.1
## [119] GenomeInfoDbData_1.2.0 mgcv_1.8-27
## [121] hms_0.5.0 prettydoc_0.3.0
## [123] tidyr_0.8.3 rmarkdown_1.14
## [125] ggpubr_0.2.2
6 Workshop materials
6.1 Source code
All source code used to produce the workshops are available at https://github.com/tiagochst/ELMER_workshop_2019.
6.2 Workshops HTMLs
- ELMER data Workshop HTML: http://rpubs.com/tiagochst/elmer-data-workshop-2019
- ELMER analysis Workshop HTML: http://rpubs.com/tiagochst/ELMER_workshop
- ATAC-seq Workshop HTML: http://rpubs.com/tiagochst/atac_seq_workshop
6.3 Workshop videos
We have a set of recorded videos, explaining some of the workshops.
- All videos playlist: https://www.youtube.com/playlist?list=PLoDzAKMJh15kNpCSIxpSuZgksZbJNfmMt
- ELMER algorithm: https://youtu.be/PzC31K9vfu0
- ELMER data: https://youtu.be/R00wG--tGo8
- ELMER analysis part1 : https://youtu.be/bcd4uyxrZCw
- ELMER analysis part2: https://youtu.be/vcJ_DSCt4Mo
- ELMER summarizing several analysis: https://youtu.be/moLeik7JjLk
- ATAC-Seq workshop: https://youtu.be/3ftZecz0lU4
References
Colaprico, Antonio, Tiago C Silva, Catharina Olsen, Luciano Garofano, Claudia Cava, Davide Garolini, Thais S Sabedot, et al. 2015. “TCGAbiolinks: An R/Bioconductor Package for Integrative Analysis of Tcga Data.” Nucleic Acids Research 44 (8): e71–e71.
Zhou, Wanding, Peter W Laird, and Hui Shen. 2017. “Comprehensive Characterization, Annotation and Innovative Use of Infinium Dna Methylation Beadchip Probes.” Nucleic Acids Research 45 (4): e22–e22.